14 research outputs found
Misspelling Oblivious Word Embeddings
In this paper we present a method to learn word embeddings that are resilient
to misspellings. Existing word embeddings have limited applicability to
malformed texts, which contain a non-negligible amount of out-of-vocabulary
words. We propose a method combining FastText with subwords and a supervised
task of learning misspelling patterns. In our method, misspellings of each word
are embedded close to their correct variants. We train these embeddings on a
new dataset we are releasing publicly. Finally, we experimentally show the
advantages of this approach on both intrinsic and extrinsic NLP tasks using
public test sets.Comment: 9 Page
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Open-domain Question Answering models which directly leverage question-answer
(QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show
promise in terms of speed and memory compared to conventional models which
retrieve and read from text corpora. QA-pair retrievers also offer
interpretable answers, a high degree of control, and are trivial to update at
test time with new knowledge. However, these models lack the accuracy of
retrieve-and-read systems, as substantially less knowledge is covered by the
available QA-pairs relative to text corpora like Wikipedia. To facilitate
improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very
large resource of 65M automatically-generated QA-pairs. We introduce a new
QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and
caches test questions, enabling RePAQ to match the accuracy of recent
retrieve-and-read models, whilst being significantly faster. Using PAQ, we
train CBQA models which outperform comparable baselines by 5%, but trail RePAQ
by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be
configured for size (under 500MB) or speed (over 1K questions per second)
whilst retaining high accuracy. Lastly, we demonstrate RePAQ's strength at
selective QA, abstaining from answering when it is likely to be incorrect. This
enables RePAQ to ``back-off" to a more expensive state-of-the-art model,
leading to a combined system which is both more accurate and 2x faster than the
state-of-the-art model alone
How Decoding Strategies Affect the Verifiability of Generated Text
Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generated by state-of-the-art pre-trained language models. A generated sentence is verifiable if it can be corroborated or disproved by Wikipedia, and we find that the verifiability of generated text strongly depends on the decoding strategy. In particular, we discover a tradeoff between factuality (i.e., the ability of generating Wikipedia corroborated text) and repetitiveness. While decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Based on these finding, we introduce a simple and effective decoding strategy which, in comparison to previously used decoding strategies, produces less repetitive and more verifiable text
How Decoding Strategies Affect the Verifiability of Generated Text
Recent progress in pre-trained language models led to systems that are able
to generate text of an increasingly high quality. While several works have
investigated the fluency and grammatical correctness of such models, it is
still unclear to which extent the generated text is consistent with factual
world knowledge. Here, we go beyond fluency and also investigate the
verifiability of text generated by state-of-the-art pre-trained language
models. A generated sentence is verifiable if it can be corroborated or
disproved by Wikipedia, and we find that the verifiability of generated text
strongly depends on the decoding strategy. In particular, we discover a
tradeoff between factuality (i.e., the ability of generating Wikipedia
corroborated text) and repetitiveness. While decoding strategies such as top-k
and nucleus sampling lead to less repetitive generations, they also produce
less verifiable text. Based on these finding, we introduce a simple and
effective decoding strategy which, in comparison to previously used decoding
strategies, produces less repetitive and more verifiable text.Comment: accepted at Findings of EMNLP 202
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserini - a widely used toolkit
for reproducible IR research can be integrated with the Hugging Face ecosystem
of open-source AI libraries and artifacts. We leverage the existing
functionalities of both platforms while proposing novel features further
facilitating their integration. Our goal is to give NLP researchers tools that
will allow them to develop retrieval-based instrumentation for their data
analytics needs with ease and agility. We include a Jupyter Notebook-based walk
through the core interoperability features, available on GitHub at
https://github.com/huggingface/gaia. We then demonstrate how the ideas we
present can be operationalized to create a powerful tool for qualitative data
analysis in NLP. We present GAIA Search - a search engine built following
previously laid out principles, giving access to four popular large-scale text
collections. GAIA serves a dual purpose of illustrating the potential of
methodologies we discuss but also as a standalone qualitative analysis tool
that can be leveraged by NLP researchers aiming to understand datasets prior to
using them in training. GAIA is hosted live on Hugging Face Spaces -
https://huggingface.co/spaces/spacerini/gaia
Scaling Data-Constrained Language Models
The current trend of scaling language models involves increasing both
parameter count and training dataset size. Extrapolating this trend suggests
that training dataset size may soon be limited by the amount of text data
available on the internet. Motivated by this limit, we investigate scaling
language models in data-constrained regimes. Specifically, we run a large set
of experiments varying the extent of data repetition and compute budget,
ranging up to 900 billion training tokens and 9 billion parameter models. We
find that with constrained data for a fixed compute budget, training with up to
4 epochs of repeated data yields negligible changes to loss compared to having
unique data. However, with more repetition, the value of adding compute
eventually decays to zero. We propose and empirically validate a scaling law
for compute optimality that accounts for the decreasing value of repeated
tokens and excess parameters. Finally, we experiment with approaches mitigating
data scarcity, including augmenting the training dataset with code data or
removing commonly used filters. Models and datasets from our 400 training runs
are freely available at https://github.com/huggingface/datablations.Comment: 47 pages (9 main), 37 figures, 13 table
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when fine-tuned on
downstream NLP tasks. However, their ability to access and precisely manipulate
knowledge is still limited, and hence on knowledge-intensive tasks, their
performance lags behind task-specific architectures. Additionally, providing
provenance for their decisions and updating their world knowledge remain open
research problems. Pre-trained models with a differentiable access mechanism to
explicit non-parametric memory can overcome this issue, but have so far been
only investigated for extractive downstream tasks. We explore a general-purpose
fine-tuning recipe for retrieval-augmented generation (RAG) -- models which
combine pre-trained parametric and non-parametric memory for language
generation. We introduce RAG models where the parametric memory is a
pre-trained seq2seq model and the non-parametric memory is a dense vector index
of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG
formulations, one which conditions on the same retrieved passages across the
whole generated sequence, the other can use different passages per token. We
fine-tune and evaluate our models on a wide range of knowledge-intensive NLP
tasks and set the state-of-the-art on three open domain QA tasks, outperforming
parametric seq2seq models and task-specific retrieve-and-extract architectures.
For language generation tasks, we find that RAG models generate more specific,
diverse and factual language than a state-of-the-art parametric-only seq2seq
baseline.Comment: Accepted at NeurIPS 202
KILT: a Benchmark for Knowledge Intensive Language Tasks
Challenging problems such as open-domain question answering, fact checking,
slot filling and entity linking require access to large, external knowledge
sources. While some models do well on individual tasks, developing general
models is difficult as each task might require computationally expensive
indexing of custom knowledge sources, in addition to dedicated infrastructure.
To catalyze research on models that condition on specific information in large
textual resources, we present a benchmark for knowledge-intensive language
tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia,
reducing engineering turnaround through the re-use of components, as well as
accelerating research into task-agnostic memory architectures. We test both
task-specific and general baselines, evaluating downstream performance in
addition to the ability of the models to provide provenance. We find that a
shared dense vector index coupled with a seq2seq model is a strong baseline,
outperforming more tailor-made approaches for fact checking, open-domain
question answering and dialogue, and yielding competitive results on entity
linking and slot filling, by generating disambiguated text. KILT data and code
are available at https://github.com/facebookresearch/KILT.Comment: accepted at NAACL 202
FinGPT: Large Generative Models for a Small Language
Large language models (LLMs) excel in many tasks in NLP and beyond, but most
open models have very limited coverage of smaller languages and LLM work tends
to focus on languages where nearly unlimited data is available for pretraining.
In this work, we study the challenges of creating LLMs for Finnish, a language
spoken by less than 0.1% of the world population. We compile an extensive
dataset of Finnish combining web crawls, news, social media and eBooks. We
pursue two approaches to pretrain models: 1) we train seven monolingual models
from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the
pretraining of the multilingual BLOOM model on a mix of its original training
data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
For model evaluation, we introduce FIN-bench, a version of BIG-bench with
Finnish tasks. We also assess other model qualities such as toxicity and bias.
Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.Comment: 17 pages (10 main), 7 figures, 5 table